audio-visual association
See, Hear, Explore: Curiosity via Audio-Visual Association
Exploration is one of the core challenges in reinforcement learning. A common formulation of curiosity-driven exploration uses the difference between the real future and the future predicted by a learned model. However, predicting the future is an inherently difficult task which can be ill-posed in the face of stochasticity. In this paper, we introduce an alternative form of curiosity that rewards novel associations between different senses. Our approach exploits multiple modalities to provide a stronger signal for more efficient exploration. Our method is inspired by the fact that, for humans, both sight and sound play a critical role in exploration.
Review for NeurIPS paper: See, Hear, Explore: Curiosity via Audio-Visual Association
Weaknesses: My biggest concern with this paper is the treatment of error as reward, or as this paper refers to it, "curiosity by self-supervised prediction." The "couch-potato" issues associated with using error as reward (described in lines 117-121) have been known for decades (e.g., Schmidhuber, 1991, towards the end of Section 3) yet we seem to have to keep re-discovering them. Can you address why it makes sense to use error as reward in your setting despite this problem? It seems particularly concerning since a stated "longer-term goal is to deploy multimodal curiosity on physical robots," a setting with inherent stochasticity. Could you please provide some reasons why you believe that "discovering new sight and sound associations" (lines 122-123) could mitigate the couch-potato problem?
See, Hear, Explore: Curiosity via Audio-Visual Association
Exploration is one of the core challenges in reinforcement learning. A common formulation of curiosity-driven exploration uses the difference between the real future and the future predicted by a learned model. However, predicting the future is an inherently difficult task which can be ill-posed in the face of stochasticity. In this paper, we introduce an alternative form of curiosity that rewards novel associations between different senses. Our approach exploits multiple modalities to provide a stronger signal for more efficient exploration.
See, Hear, Explore: curiosity via audio-visual association
To compute audio features, we take an audio clip spanning 4 time steps (th of a second for these 60 frame per second environments) and apply a Fast Fourier Transform (FFT). The FFT output is downsampled using max pooling to a 512-dimensional feature vector, which is used as input to the discriminator along with a 512-dimensional visual feature vector.